Improving Dependency Parsing with Interlinear Glossed Text and Syntactic Projection
نویسندگان
چکیده
Producing annotated corpora for resource-poor languages can be prohibitively expensive, while obtaining parallel, unannotated corpora may be more easily achieved. We propose a method of augmenting a discriminative dependency parser using syntactic projection information. This modification will allow the parser to take advantage of unannotated parallel corpora where high-quality automatic annotation tools exist for one of the languages. We use corpora of interlinear glossed text—short bitexts commonly found in linguistic papers on resource-poor languages with an additional gloss line that supports word alignment—and demonstrate this technique on eight different languages, including resource-poor languages such as Welsh, Yaqui, and Hausa. We find that incorporating syntactic projection information in a discriminative parser generally outperforms deterministic syntactic projection. While this paper uses small IGT corpora for word alignment, our method can be adapted to larger parallel corpora by using statistical word alignment instead.
منابع مشابه
Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text
As most of the world’s languages are under-resourced, projection algorithms offer an enticing way to bootstrap the resources available for one resourcepoor language from a resource-rich language by means of parallel text and word alignment. These algorithms, however, make the strong assumption that the language pairs share common structures and that the parse trees will resemble one another. Th...
متن کاملExtracting Interlinear Glossed Text from LaTeX Documents
We present texigt, a command-line tool for the extraction of structured linguistic data from LTEX source documents, and a language resource that has been generated using this tool: a corpus of interlinear glossed text (IGT) extracted from open access books published by Language Science Press. Extracted examples are represented in a simple XML format that is easy to process and can be used to va...
متن کاملEnriching Interlinear Text using Automatically Constructed Annotators
In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more tha...
متن کاملEnriching, Editing, and Representing Interlinear Glossed Text
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swathe of the worlds languages. In many cases this involves bootstrapping the learning process with enriched or...
متن کاملبرچسبزنی خودکار نقشهای معنایی در جملات فارسی به کمک درختهای وابستگی
Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...
متن کامل